First, load the packages:
library(MASS)
library(class)
library(ISLR)
library(tidyverse)
library(ggplot2)# set the seed
set.seed(123)Default dataset, where
balance is mapped to the x position, income is
mapped to the y position, and default is mapped to the
colour. Can you see any interesting patterns already?It seems people with higher average balance tend to default on their debt.
# check the data
glimpse(Default)Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
Default %>% ggplot(aes(x = balance, y = income, color=default)) +
geom_point(alpha=0.5, size=1)facet_grid(cols = vars(student)) to the plot.
What do you see?Student group has lower income.
Default %>% ggplot(aes(x = balance, y = income, color=default)) +
geom_point(alpha=0.5, size=1) +
facet_grid(cols=vars(student))ifelse() (0 = not a student, 1 = student). Then, randomly
split the Default dataset into a training set default_train
(80%) and a test set default_test (20%).Default <- Default %>%
mutate(student_dum = ifelse(student == "No", 0, 1),
split = sample(c("train", "test"), size = nrow(.), replace=T, prob=c(0.8, 0.2)))
default_train <- Default %>% filter(split=="train")
default_test <- Default %>% filter(split=="test")knn() function. Use student,
balance, and income (but no basis functions of
those variables) in the default_train dataset. Set
k to 5. Store the predictions in a variable called
knn_5_pred.knn_5_pred <- knn(train = default_train[c("student_dum", "balance", "income")],
test = default_test[c("student_dum", "balance", "income")],
cl = as.factor(default_train$default),
k =5)
## train, test df cannot take a factor variable (e.g., `student`). That's why we use `student_dum` (numeric) instead.default) mapped to
the colour aesthetic, and one with the predicted class
(knn_5_pred) mapped to the colour aesthetic.Hint: Add the predicted class knn_5_pred to the
default_test dataset before starting your
ggplot() call of the second plot. What do you see?
default_test %>% ggplot(aes(x = balance, y = income, color=default)) +
geom_point(alpha=0.5, size=1) default_test %>%
mutate(predicted = knn_5_pred) %>%
ggplot(aes(x = balance, y = income, color=predicted)) +
geom_point(alpha=0.5, size=1) knn_2_pred
vector generated from a 2-nearest neighbours algorithm. Are there any
differences?knn_2_pred <- knn(train = default_train[c("student_dum", "balance", "income")],
test = default_test[c("student_dum", "balance", "income")],
cl = as.factor(default_train$default),
k =2)
default_test %>%
mutate(predicted2 = knn_2_pred) %>%
ggplot(aes(x = balance, y = income, color=predicted2)) +
geom_point(alpha=0.5, size=1) Off-diagonals would be zero!
5-nn model has more false negatives and 2-nn model has a lot more false positives. In general, 5-nn model has a better accuracy in prediction.
# 5-nn model
t5 <- table(true = default_test$default, knn_5_pred);t5 knn_5_pred
true No Yes
No 1873 7
Yes 58 11
acc5 <- sum(diag(t5))/sum(t5); acc5[1] 0.9666496
# 2-nn model
t2 <- table(true = default_test$default, knn_2_pred);t2 knn_2_pred
true No Yes
No 1838 42
Yes 45 24
acc2 <- sum(diag(t2))/sum(t2); acc2[1] 0.9553617
glm() with argument
family = binomial to fit a logistic regression model
lr_mod to the default_train data.lr_mod <- glm(default ~ balance + income + student_dum, family=binomial, data = default_train)lr_mod. You can choose for yourself
which type of visualisation you would like to make. Write down your
interpretations along with your plot.train_df <- data.frame(pred = predict(lr_mod, type = "response"),
obs = default_train$default)
train_df %>% ggplot(aes(x=obs, y = pred)) +
geom_jitter(color=alpha("skyblue", 0.7), size = 1) + theme_minimal()lr_mod model and
interpret the coefficient for balance. What would the
probability of default be for a person who is not a student, has an
income of 40000, and a balance of 3000 dollars at the end of each month?
Is this what you expect based on the plots we’ve made before?The coefficient for balance is 0.006, which thus \(e^{0.006} = 1.005\). It means that each
unit increase in balance is expected to make the odds of defaulting
1.005 times higher.
The probability of defaulting is 0.998 and it makes sense based on the previous plot, as this new data point would fall onto the very right side of the plot where it is predicted to be defaulting.
betas <- coef(lr_mod)
# coefficient for `balance`
exp(betas['balance']) balance
1.005973
# logodds
logodds <- (betas['(Intercept)'] + 0*betas['student_dum'] + 40000*betas['income'] + 3000*betas['balance'])
logodds <- unname(logodds) # drop the names
# probability
exp(logodds)/(1 + exp(logodds))[1] 0.9987839
balance_df with 3
columns and 500 rows: student always 0,
balance ranging from 0 to 3000, and income
always the mean income in the default_train dataset.balance_df <- data.frame(
student_dum = rep(0, 500), # following my notation
balance = seq(0, 3000, length.out=500),
income = rep(mean(default_train$income), 500)
)newdata in a
predict() call using lr_mod to output the
predicted probabilities for different values of balance.
Then create a plot with the balance_df$balance variable
mapped to x and the predicted probabilities mapped to y. Is this in line
with what you expect?pred_prob <- predict(lr_mod, newdata = balance_df, type = 'response')
gg_df <- data.frame(obs = balance_df$balance,
pred = pred_prob)
gg_df %>% ggplot(aes(x = obs, y = pred)) +
geom_point()Yes, logistic regression performs better given tht its prediction accruacy is higher.
pred_prob <- predict(lr_mod, newdata = default_test, type="response")
lr_pred <- ifelse(pred_prob > 0.5, "Yes", " No")
lr_t <- table(true=default_test$default, lr_pred)
lr_acc <- sum(diag(lr_t)); lr_acc[1] 1890
lda_mod on the training
set.lda_mod <- lda(default ~ balance + income + student_dum, data = default_train)lda_mod object. What can you conclude
about the characteristics of the people who default on their loans?ANSWER !!!
lda_modCall:
lda(default ~ balance + income + student_dum, data = default_train)
Prior probabilities of groups:
No Yes
0.96720904 0.03279096
Group means:
balance income student_dum
No 804.8662 33549.69 0.2916399
Yes 1771.5911 31821.97 0.3977273
Coefficients of linear discriminants:
LD1
balance 2.240343e-03
income 2.843076e-06
student_dum -1.451089e-01
pred_lda <- predict(lda_mod, newdata = default_test)
lda_t <- table(true = default_test$default, predicted = pred_lda$class)
lda_acc <- sum(diag(lda_t)); lda_acc[1] 1891
data/
folder. Would the passenger have survived if they were a girl in 2nd
class?The boy is likely to die (\(p_0 = 0.91\)) and girl is likely to survive (\(p_1 = 0.74)\).
# load data
titanic <- read.csv("data/Titanic.csv")
# check data
glimpse(titanic)Rows: 1,313
Columns: 5
$ Name <chr> "Allen, Miss Elisabeth Walton", "Allison, Miss Helen Loraine"…
$ PClass <chr> "1st", "1st", "1st", "1st", "1st", "1st", "1st", "1st", "1st"…
$ Age <dbl> 29.00, 2.00, 30.00, 25.00, 0.92, 47.00, 63.00, 39.00, 58.00, …
$ Sex <chr> "female", "female", "male", "female", "male", "male", "female…
$ Survived <int> 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1…
# LDA
lda_mod_titanic <- lda(Survived ~ ., data = titanic[,-1])
# prediction
pred_lda_titanic <- predict(lda_mod_titanic, newdata = data.frame(PClass = rep("3rd", 2),
Age = rep(14, 2),
Sex=c("male", "female")))
pred_lda_titanic$posterior 0 1
1 0.9102243 0.08977569
2 0.2636100 0.73638996